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Abstract 

Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. 
While promising, most of the existing works use convex methods, which may be suboptimal in terms 
of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex 
paradigm to sparse group feature selection, which is motivated by applications that require identifying 
the underlying group structure and performing feature selection simultaneously. The main contributions 
of this article are twofold: (1) statistically, we introduce a nonconvex sparse group feature selection 
model which can reconstruct the oracle estimator. Therefore, consistent feature selection and parameter 
estimation can be achieved; (2) computationally, we propose an efficient algorithm that is applicable to 
large-scale problems. Numerical results suggest that the proposed nonconvex method compares favorably 
against its competitors on synthetic data and real-world applications, thus achieving desired goal of 
delivering high performance. 



1 Introduction 

During the past decade, sparse feature selection has been extensively investigated, on both optimization 
algorithms pQ and statistical properties [55J [SU1 13] • When the data possesses certain group structure, sparse 
modeling has been explored in [5H [TFJ H3] for group feature selection. The group lasso [23] proposes an 
I/2-regularization method for each group, which ultimately yields a group-wisely sparse model. The utility 
of such a method has been demonstrated in detecting splice sites j23j — an important step in gene finding 
and theoretically justified in [13]. The sparse group lasso [TT] enables to encourage sparsity at the level of 
both features and groups simultaneously. 

In the literature, most approaches use convex methods due to globality of the solution and tractable 
computation. However, this may lead to suboptimal results. Recent studies demonstrate that nonconvex 
methods, for instance, the truncated Li-penalty [T9J [T5j [27], may have potential to deliver superior perfor- 
mance than the standard Li-formulation. In addition, |19j suggests that a constrained version of nonconvex 
regularization is slightly more preferable than its regularization counterpart due to theoretical merits. 

In this article, we investigate the sparse group feature selection (SGFS) through a constrained nonconvex 
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formulation. Ideally, we wish to optimize the following Lo-model: 

minimize — \\Ax — 
x 2 

p 

subject to H\ x j\ 7^ 0) < si 

3 -=i (!) 

|G| 

$>(ii*Gj a ^o)<* a , 

i=i 

where A is an n by p data matrix with its columns representing different features, x = (xi, • • • ,x p ) is 
partitioned into |G| non-overlapping groups {#Gi} and /(•) is the indicator function. The advantage of the 
io-nrodcl (QJ lies in its complete control on two levels of sparsity (si, s%), which are the numbers of features 
and groups respectively. However, a problem like ([T]) is known to be NP-hard [17] . 

This paper develops an efficient nonconvex method, which is a computational surrogate of the Lo-method 
described above and has theoretically guaranteed performance. We contribute in two aspects: (i) statistically, 
the proposed method retains the merits of the Lq approach ([1} in the sense that the oracle estimator can 
be reconstructed, which leads to consistent feature selection and parameter estimation; (ii) computationally, 
our efficient optimization tool enables to treat large-scale problems. 



2 Nonconvex Formulation and Computation 

One major difficulty of solving ([1]) comes from nonconvex and discrete constraints, which require enumerating 
all possible combinations of features and groups to achieve the optimal solution. Therefore we approximate 
these constraints by their continuous computational surrogates: 

minimize — ll-Ax — 
x 2 

p 

subject to J T (\xj\) < si, 

3=1 W 

\G\ 

^Jr(||x Gi || 2 )<S 2l 
i=l 

where J T (z) — min(|,z|/T, I) is a truncated Li-function approximating the Lo-function [19, 26i, and r > is 
a tuning parameter such that J T {z) approximates the indicator function I(\z\ ^ 0) as r approaches zero. 

To solve the nonconvex problem we develop a Difference of Convex (DC) algorithm based on a 
decomposition of each nonconvex constraint function into a difference of two convex functions; for instance, 

p 

^ J T (\ Xj \) = Srix) - S 2 (x), 



3 = 1 



where 



I 

T * — ' 



and 

I P 

S 2 (x) = — maxljxjl — r, 0} 
T i=i 
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are convex in x. Then each trailing convex function, say S 2 (x), is replaced by its affine minorant at the 
previous iteration 

Sx(x)- S 2 {x {m -^)-VS 2 {x {m -V) T {x-x {m -V), (3) 
which yields an upper approximation of the constraint function J r (|xj|) as follows: 

\ J2 \xj\ ■ < r) +irm ( r 1} \ >r)<sr. (4) 

T 3=1 3 = 1 

Similarly, the second nonconvex constraint in ([2]) can be approximated by 

-£>cja • JGIx^lh < r) +£/(||47 1) || 2 > r) < fla . (5) 

T 3=1 3 = 1 

Note that both Q and © are convex constraints, which result in a convex subproblem as follows: 

minimize — — ylln 
x 2 

subject to illaj^^^lli^si-fp-mfi^- 1 ))!) (6) 

T 

-H^^Hg <s 2 - (\G\ - \T 2 (x^)\), 

T 

where T%, T 2 and T 3 are the support setifj] defined as: 

T 1 {x)={i: \x t \ <t} 

T 2 {x) = {i : \\x Gi h<r} 

T 3 (x) = {i : Xi G x Gj ,j G T 2 (x)}, 

||a; Tl ||i and ||a; T3 |jG denote the corresponding value restricted on T\ and T3 respectively, and \\x\\a = 
12i=i \\ x Gi Solving ([6]) would provide us an updated solution, denoted as x^ m \ Such procedure is 
iterated until the objective value is no longer decreasing, indicating that a local minimizer is achieved. The 
DC algorithm is summarized in Algorithm [1] from which we can see that efficient computation of ([5]) is 
critical to the overall DC routine. We defer detailed discussion of this part to Section 2] 

Algorithm 1 DC programming for solving ([2]) 
Input: A, y, s%, s 2 
Output: solution x to Q 
1: (Initialization) Initialize x^°\ 

2: (Iteration) At iteration m, compute x^ by optimizing ©. 

3: (Stopping Criterion) Terminate when the objective function stops decreasing. 



3 Theoretical Results 

This section investigates theoretical aspects of the proposed method. More specifically, we demonstrate that 
the oracle estimator x°, the least squares estimator based on the true model, can be reconstructed. As a 
result, consistent selection as well as optimal parameter estimation can be achieved. 

1 Support sets indicate that the elements outside these sets have no effect on the particular items in the constraints of Jot . 



3 



For better presentation, we introduce some notations that would be only utilized in this section. Let 
C = (Gi ± ,--- ,Gi k ) be the collection of groups that contain nonzero elements. Let Aq, = Aq. (x) and 
A = A(x) denote the indices of nonzero elements of x in group Gj and in entire x respectively. Define 

Sj,i = {xeS: (A C ,C) ± (A C0 ,C°), \A\ = j, \C\ = i}, 

where S is the feasible region of © and C° represents the true nonzero groups. 

The following assumptions are needed for obtaining consistent reconstruction of the oracle estimator: 
Assumption 1 (Separation condition). Define 

C ■ (x°) inf -l°g(l-fe 2 (g,g°)) 
mml max(|C°\C|,l) ' 

then for some constant c\ > 0, 

n 

where 

h(x, x°) = (lj (g^(x, y) - gV*(x°, y)fd^y)) 1/2 
is the Hellinger- distance for densities with respect to a dominating measure \x. 

Assumption 2 (Complexity of the parameter space). For some constants cq > and any < t < e < 1, 

H(t,Fj,i) < c max((log(|G| + a?)) 2 , 1)1^1 log(2e/i), 

where Bj t i = Sjj n {x G h(x,x°) < 2e} is a local parameter space and Tj.i — {g 1 ^ 2 (x,y) : x € is a 

collection of square-root densities. H{-,!F) is the bracketing Hellinger metric entropy of space T [Ufl - 

Assumption 3. For some positive constants di,d2,d 3 with d\ > 10, 

- log(l - h 2 (x, x )) > -d x log(l - h 2 (x T , x Q )) - d 3 T d2 p, 
where x T — (xil(\xi | > r), • • ■ , x p /(|a;p| > r)). 

With the above assumptions hold, we can conclude the following non-asymptotic probability error bound 
regarding the reconstruction of the oracle estimator x°. 

Theorem 1. Suppose that Assumptions^ and\3\hold. For a global minimizer of @ x with (s±, S2) = ^2) 
and t < ( ( rfl ~ 10 ^ ( ^° m ( a! ) ^ 1 / d2 ^ j H 0W i n g result hold: 



F[x ? x°j < cxp ^ - c 2 nC min (x°) + 2 (log \G\ + logs?) 

Moreover, with Assumption^ hold, ¥^x = x°^j — > 1 and 



Eh 2 (x, x°) = (1 + o(l)) max(Eh 2 (x°, x°), -i) 

n 

as n — > 00, \G\ — > 00. 

Theorem [1] states that the oracle estimator x° can be accurately reconstructed, which in turn yields 
feature selection consistency as well as the recovery of the performance of the oracle estimator in parameter 
estimation. Moreover, according to Assumption [T] such conclusion still holds when s < {\G\ grows in the order 
of exp(c 1 ~ 1 nC ram ) . This is in contrast to existing conclusions on consistent feature selection, where the 
number of candidate features should be no larger than exp(c*n) for some c* [28] . In this sense, the number 
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of candidate features is allowed to be much larger when an additional group structure is incorporated, 
particularly when each group contains considerable redundant features. 

To our knowledge, our theory for the grouped selection is the first of this kind. However, it has a 
root in feature selection. The large deviation approach used here is applicable to derive bounds for feature 
selection consistency. In such a situation, the result agrees with the necessary condition for feature selection 
consistency for any method, except for the constants independent of the sample size [19]. In other words, 
the required conditions are weaker than those for Li-regularization |21j . The use of the Hellinger-distancc 
is mainly to avoid specifying a sub-Gaussian tail of the random error. This means that the result continues 
to hold even when the error does not have a sub-Gaussian tail. 



4 Optimization Procedures 

As mentioned in Section [2] efficient computation of the convex subproblem (|6]) is of critical importance 
for the proposed DC algorithm. Note that ([6]) has an identical form of the constrained sparse group lasso 
problem: 



minimize — 1 1 Ax — y \\ \ 



l 

(7) 



subject to ||a;||i < s\ 

IMIg < s 2 

except that x is restricted to the two support sets. As to be shown in Section POl an algorithm for solving © 
can be obtained through only a few modifications on that of ([7]). Therefore, we first focus on solving ([7J). 



4.1 Accelerated Gradient Method 

For large-scale problems, the dimensionality of data can be very high, therefore first-order optimization is 
often preferred. We adapt the well-known accelerated gradient method (AGM) [THJ 12 , which is commonly 
used due to its fast convergence rate. 

To apply AGM to our formulation ([7]), the crucial step is to solve the following Sparse Group Lasso 
Projection (SGLP): 

minimize — II a; — v II \ 
x 2 

subject to ||cc||i<si (Ci) ( 8 ) 
\\x\\ G < s 2 (C 2 ), 

which is an Euclidean projection onto a convex set and a special case of (J7J when A is the identity. For 
convenience, let C\ and Ci denote the above two constraints in what follows. 

Since the AGM is a standard framework whose efficiency mainly depends on that of the projection step, 
we leave the detailed description of AGM in the supplement and introduce the efficient algorithm for this 
projection step ©. 



4.2 Efficient Projection 

We begin with some special cases of ((8|). If only C\ exists, ((8|) becomes the well-known Li-ball projection [9], 
whose optimal solution is denoted as ^^(v), standing for the projection of v onto the Li-ball with radius 
s\. On the other hand, if only C2 is involved, it becomes the group lasso projection, denoted as Vq . 
Moreover, we say a constraint is active, if and only if an equality holds at the optimal solution x*; otherwise, 
it is inactive. 

Preliminary results are summarized in Lemma [TJ 
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Lemma 1. Denote a global minimizer of ([5]) as x* . Then the following results hold: 

1. If both C\ and C 2 are inactive, then x* = v. 

2. If G\ is the only active constraint, i.e., ||aJ*||i = s\, ||a3*||(j < s 2 , then x* —V^iv) 
3- If Ci is the only active constraint, i.e., ||a5*||i < s\, ||cc*||g = s 2 , then x* = Vq(v) 

4.2.1 Computing x* from the optimal dual variables 

Lemma [1] describes a global minimizer when either constraint is inactive. Next we consider the case in which 
both C\ and C 2 are active. By the convex duality theory [B], there exist unique non- negative dual variables 
A* and rf such that x* is also the global minimizer of the following regularized problem: 

minimize - v\\% + A*||jc||i + r?*||a;||G, (9) 

whose solution is given by the following Theorem. 

Theorem 2 ([11). The optimal solution x* of ([9]) is given by 

x* Gz =mzx{\\v£\\ 2 -n\0}-?$- i = 1, 2, • • • , |G| (10) 

\\ v G,h 

where Vq, is computed via soft-thresholding fSjjj VQ i with threshold A* as follows: 

v% t = SGN{v Gz ) ■ max{|t; Gi | - A*,0}, 
where SGN(-) is the sign function and all the operations are taken element-wisely. 

Theorem [2] gives an analytical solution of x* in an ideal situation when the values of A* and rf are given. 
Unfortunately, this is not the case and the values of A* and n* need to be computed directly from (JSJ . Based 
on Theorem [2l we have the following conclusion characterizing the relations between the dual variables: 

Corollary 1. The following equations hold: 

|G| Ilt7^lli 

\n = 5>ax{n^n 2 - ?r,o}^^ = Sl (ii) 

i= i II u gJ|2 

|G| 

||a;l G = ^max{||t; G *|| 2 -7A0} = S2 . (12) 
»=i 

Suppose A* is given, then computing n* from (|12| amounts to solving a median finding problem, which 
can be done in linear time [5]. 

Finally, we treat the case of unknown A* (thus unknown rf ) . We propose an efficient bisection approach 
to compute it. 

4.2.2 Computing A*: bisection 

Given an initial guess (estimator) of A*, says A, one may perform bisection to locate the optimal A*, provided 
that there exists an oracle procedure indicating if the optimal value is greater than This bisection method 
can estimate A* in logarithm time. Next, we shall design an oracle procedure. 

2 An upper bound and a lower bound of A* should be provided in order to perform the bisection. These bounds can be easily 
derived from the assumption that both C\ and C2 are active. 
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Let the triples 

(sc*,A*,»7*) = SGLP(»,*i,aa) 

be the optimal solution of © with both constraints active, i.e., ||a3*||i = s%, \\x*\\g = s 2> with (A*, 77*) be 
the optimal dual variables. Consider the following two sparse group lasso projections: 

(x,\,rj) = SGLP(w,si,s 2 ), 
(x',\',r,') = SGLP(v, s[,s' 2 ). 

The following key result holds. 

Theorem 3. If A < A' and s 2 — s 2 , then si > s[. 

Theorem[3]gives the oracle procedure with its proof presented in the supplement. For a given estimator A, 
we compute its corresponding f\ from (jT^j) and then S\ from (|TTj) . satisfying (x, A, fj) = SGLP(t), si, s 2 ). Then 
si is compared with s\. Clearly, by Theorem [3l if s% < si, the estimator A is no less than A*. Otherwise, 
si > s\ means A < A*. In addition, from we know that s\ is a continuous function of A. Together with 
the monotonicity given in Theorem [3J a bisection approach can be employed to calculate A* . Algorithm [5] 
gives a detailed description. 



4.3 Solving Restricted version of (J7]) 

Finally, we modify the above procedures to compute the optimal solution of the restricted problem ([6]) . To 
apply the accelerated gradient method, we consider the following projection step: 

• • • X ii 112 
minimize — 1| x — v\\ 2 

subject to ||a; Tl ||i<si (Ci) ( 13 ) 
\\x T3 \\ G <s2 (C 2 ). 

Our first observation is: T 3 (x) C Ti(x), since if an element of x lies in a group whose L 2 -norm is less 
than r, then the absolute value of this element must also be less than r. Secondly, from the decomposable 
nature of the objective function, we conclude that: 

Vj ifje(T!) c 



u 3 

"3 



vf ifje Tt\T 3 , 



since there are no constraints on Xj if it is outside T\ and involves only the Li-norm constraint if j € Ti\Ta. 
Following routine calculations as in [5], we obtain the following results similar to (fTTj) and (|12p : 



\v}* 



Sl 



^max{|K|| 2 -^,0}^+ J2 «f (I 4 ) 
s 2 = J2 max{||^* Ha-^,0}. (15) 



i£T 2 



Based on (fl"4|) and p3|) . we design a similar bisection approach to compute A* and thus (x*) T3 , as in 
Algorithm O Details are deferred to the supplement. 
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Algorithm 2 Sparse Group Lasso Projection Algorithm 



Input: v, si, s 2 

Output: an optimal solution a; to the Sparse Group Projection Problem 
Function SGLP(t>, si, s 2 ) 

l: if ||a;||i < si and ||ac || q < s 2 then 

2: return v 

3: end if 

4: X Cl =V S 1 1 (V) 

5: a;c 2 = T>g(v) 

6: £Cci2= bisec(t>, si, s 2 ) 

7: if HxcJIg < s 2 then 

8: return a;^ 

9: else if ||ajc 2 ||i — s i then 
10: return xc 2 
11: else 

12: return xc 12 
13: end if 

Function bisec(v, si, s 2 ) 



1 


Initialize up, low and 


tol 




2 


while up — low > tol do 






3 


A = (low + up)/ 2 






i 


if (|12[) has a solution 


7) given 


v x 


5 


calculate s\ using rj 


and A. 




6 


if si < si then 






7 


up = A 






8 


else 






9 


Zow = A 






10 


end if 






11 


else 






12 


up = A 






13 


end if 






14 


end while 






15 


A* = up 






16 


Solve ([HJ to get r/* 






17 


Calculate a;* from A* and 


rf via ( 


m 


18 


return x* 







5 Significance 

This section is devoted to a brief discussion of advantages of our work statistically and computationally. 
Moreover, it explains why the proposed method is useful to perform efficient and interpretable feature 
selection with a given natural group structure. 

Interpretability. The parameters in ^ are highly interpretable in that Si and s 2 are upper bounds of 
the number of nonzero elements as well as that of groups. This is advantageous, especially in the presence 
of certain prior knowledge regarding the number of features and/or that of groups. However, such an 
interpretation vanishes with convex methods such as lasso or sparse group lasso, in which incorporating such 
prior knowledge often requires repeated trials of different parameters. 

Parameter tuning. Typically, tuning parameters for good generalization usually requires considerable 
amount work due to a large number of choices of parameters. However, tuning in ([T]) may search through 
integer values in a bounded range, and can be further simplified when certain prior knowledge is available. 
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Table 1: Running time (in seconds) of Dykstra's, ADMM and our projection algorithm. All three algorithms 
are averaged over 100 replications. 



Methods 


10* 




10* 


10 s 


10" 


Dykstra 
ADMM 

OURS 


0.1944 
0.0519 

< 10~ 7 


0.5894 
0.1098 
0.0002 


4.8702 
1.2000 
0.0051 


51.756 
26.240 
0.0440 


642.60 
633.00 
0.5827 



This permits more efficient tuning than its regularization counterpart. Based on our limited experience, we 
note that t does not need to be tuned precisely as we may fix at some small values. 

Performance and Computation. Although our model ((2]) is proposed as a computational surrogate of 
the ideal Lo-method, its performance can also be theoretically guaranteed, i.e., consistent feature selection 
can be achieved. Moreover, the computation of our model is much more efficient and applicable to large-scale 
applications. 



6 Empirical Evaluation 

This section performs numerical experiments to evaluate the proposed methods in terms of the efficiency 
and accuracy of sparse group feature selection. Evaluations are conducted on a PC with i7-2600 CPU, 8.0 
GB memory and 64-bit Windows operating system. 

6.1 Evaluation of Projection Algorithms 

Since the DC programming and the accelerated gradient methods are both standard, the efficiency of the 
proposed nonconvex formulation ([2]) depends on the projection step in ([8]). Therefore, we focus on evaluating 
the projection algorithms and comparing with two popular projection algorithms: Alternating Direction 
Multiplier Method (ADMM) [5] and Dykstra's projection algorithm [7] . We provide a detailed derivation of 
adapting these two algorithms to our formulation in the supplement. 

To evaluate the efficiency, we first generate the vector v whose entries are uniformly distributed in 
[—50,50] and the dimension of v, denoted as p, is chosen from the set {10 2 , 10 3 , 10 4 , 10 5 , 10 6 }. Next we 
partition the vector into 10 groups of equal size. Finally, S2 is set to 51og(p) and si, the radius of the 
i^-ball, is computed by -^^pS2 (motivated by the fact that Si < VTOS2). 

For a fair comparison, we run our projection algorithm until converge and record the minimal objective 
value as /*. Then we run ADMM and Dykstra's algorithm until their objective values become close to ours. 
More specifically, we terminate their iterations as soon as /admm — /* < 10 -3 and /Dykstra — /* < 10~ 3 , 
where /admm and /Dykstra stand for the objective value of ADMM and Dykstra's algorithm respectively. 
Table [T] summarizes the average running time of all three algorithms over 100 replications. 

Next we demonstrate the accuracy of our projection algorithm. Toward this end, the general convex 
optimization toolbox CVX is chosen as the baseline. Following the same strategy of generating data, 
we report the distance (computed from the Euclidean norm || • H2) between optimal solution of the three 
projection algorithms and that of the CVX. Note that the projection is strictly convex with a unique global 
optimal solution. 

For ADMM and Dykstra's algorithm, the termination criterion is that the relative difference of the 
objective values between consecutive iterations is less than a threshold value. Specifically, we terminate the 
iteration if |/(xfc_i) — f{xk)\ < 10 _7 /(ajfc_i). For our projection algorithm, we set the tol in Algorithm [5] to 
be 10 -7 . The results are summarized in Tabled Powered by second-order optimization algorithms, CVX can 
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provide fast and accurate solution for problems of moderate size but would suffer from great computational 
burden for large-scale ones. Therefore we only report the results up to 5,000 dimensions. 

Table 2: Distance between the optimal solution of projection algorithms and that of the CVX. All the results 
are averaged over 100 replications. 



Methods 


50 


100 


500 


1000 


5000 


Dykstra 
ADMM 
ours 


9.00 
0.64 
1.4e-3 


9.81 
0.08 
l.le-3 


11.40 
3.6e-3 
1.2e-3 


11.90 
6.3e-3 
1.7e-3 


12.42 
1.3e-2 
7.3e-3 



From Tables [T] and [2 we note that both ADMM and our algorithm yield more accurate solution than that 
of Dykstra's. For projections of moderate size, all three algorithms perform well. However, for large-scale 
ones, our advantage on efficiency is evident. 

6.2 Performance on Synthetic Data 

6.2.1 Experimental Setup 

We generate a 60 x 100 matrix A, whose entries follow i.i.d standard normal distribution. The 100 features 
(columns) are partitioned into 10 groups of equal size. The ground truth vector Xq possesses nonzero elements 
only in 4 of the 10 groups. To further enhance sparsity, in each nonzero group of Xq, only t (t < 10) elements 
are nonzero, where t is uniformly distributed from [1,5]. Finally y is generated according to Ax + z with 
z following distribution Af(0, 0.5 2 ), where A and y are divided into training and testing set of equal size. 

We fit our method to the training set and compare with lasso, group lasso and sparse group lasso. The 
tuning parameters of the convex methods are selected from {0.01,0.1,1,10}, whereas for our method, the 
number of nonzero groups is selected from the set {2,4,6,8} and the number of features is chosen from 
{2s2j 4s2, 6s2, 8S2}. Leave-one-out cross-validation is conducted over the training set for choosing the best 
tuning parameter for all the methods. 

6.2.2 Results and Discussions 

We use following metrics for evaluation: 

• Estimation error: \\x — a?o II 2 

• Prediction error: || — y|| | 

• Group precision: \T 2 (x) n T 2 (xo)\/\T 2 (x)\ 

• Group recall: \T 2 {x) n T 2 (x )\/\T 2 (xo)\ 

where x is the estimator obtained from ((2J and y is an independent vector following the same distribution 
as y. The group precision and recall demonstrate the capability of recovering the group structure from data. 
We report the results in Table [3] and observe that our model generally exhibits better performance. Note 
that although our model does not provide the best result on the metric of group recall, the group precision of 
our model is significantly better than the others, illustrating the fact that the three convex methods recover 
more redundant groups. 
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Table 3: Comparison of Performance on synthetic data, where glasso stands for the group lasso and sglasso 
denotes sparse group lasso. All the results are averaged for 100 replications. 



Methods 


ESTI. 


Pred. 


Prec. 


Reg. 


lasso 


4.7933 


151.05 


0.5212 


0.8700 


GLASSO 


8.1230 


244.53 


0.5843 


0.7575 


SGLASSO 


4.7649 


151.29 


0.5215 


0.8675 


OURS 


4.6617 


142.18 


0.7848 


0.6450 



6.3 Performance on Real-world Application 

Our method is further evaluated on the application of examining Electroencephalography (EEG) correlates 
of genetic predisposition to alcoholism [10] . EEG records the brain's spontaneous electrical activity by 
measuring the voltage fluctuations over multiple electrodes placed on the scalp. This technology has been 
widely used in clinical diagnosis, such as coma, brain death and genetic predisposition to alcoholism. In fact, 
encoded in the EEG data is a certain group structure, since each electrode records the electrical activity of a 
certain region of the scalp. Identifying and utilizing such spatial information has the potential of increasing 
stability of a prediction. 

The training set contains 200 samples of 16384 dimensions, sampled from 64 electrodes placed on subject's 
scalps at 256 Hz (3.9-msec epoch) for 1 second. Therefore, the data can naturally be divided into 64 groups 
of size 256. We apply the lasso, group lasso, sparse group lasso and our method on the training set and 
adapt the 5-fold cross-validation for selecting tuning parameters. More specifically, for lasso and group lasso, 
the candidate tuning parameters are specified by 10 parameters^] sampled using the logarithmic scale from 
the parameter spaces, while for the sparse group lasso, the parameters form a 10 x 10 gricQ, sampled from 
the parameter space in logarithmic scale. For our method, the number of groups is selected from the set: 
s-2 = {30, 40, 50} and si, the number of features is chosen from the set {50s2, 100s2, 150s2j-- The accuracy of 
classification together with the number of selected features and groups over a test set, which also contains 
200 samples, are reported in Table 0] Clearly our method achieves the best performance of classification 
with the least number of groups. Note that, although lasso's performance is almost as good as ours with 
even less features, however, it fails to identify the underlying group structure in the data, as revealed by the 
fact all 64 groups are selected. 

Table 4: Comparison of performance on EEG data, where glasso stands for group lasso and sglasso denotes 
sparse group lasso. 



Methods 


Accuracy 


# Feature 


# Group 


lasso 


67.0 


2068 


64 


glasso 


62.5 


8704 


34 


sglasso 


65.5 


4834 


61 


ours 


68.0 


3890 


25 



7 Conclusion and Future Work 

This paper expands a nonconvex paradigm into sparse group feature selection. In particular, theoretical 
properties on the accuracy of selection and parameter estimation are analyzed. In addition, an efficient 
optimization scheme is developed based on the DC programming, accelerated gradient method and efficient 

3 ^iasso = logspace(10 -3 , 1), A glasso = logspace(10 -2 , 1) 
4 The product space of A lasso X A glasso 
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projection. The efficiency and efficacy of the proposed method are validated on both synthetic data and 
real-world applications. 

The proposed method will be further investigated on real-world applications involving the group struc- 
ture. Moreover, extending the proposed model to multi-modal multi-task learning |25j is another promising 
direction. 
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Supplementary Material for paper: 
Efficient Sparse Group Feature Selection via 
Nonconvex Optimization 



1 Proof of Theorem [T] 

The proof uses a large deviation probability inequality of [35] to treat one-sided log-likelihood ratios with 
constraints. 

Let S = {x T : \\x T \\ < s?, || £c T " || ,g < ||£c|| = Y%=i I (\ x j\ ^ °) is tnc L a -novm of x, and ||a;||o,G = 
Ej=i I\\\ x j\\2 7^ 0) is the Lo-norm over the groups. Now we partition S. Note that for C C (Gi, • • • , G| G |), 
it can be partitioned into C = (C \ C°) U (C n C°). Then 

»S 

5 =U U <W, 

where 5 Ac , c = {x r G 5 : C(x) =C= (G n ,--- ^J^. L4 G .| < and = {G ^ G : |G° \ G| = 

i, \c\ < s °}, with \Bi \ = ( s ^.) e; =0 ( |g| 7 s °); i = 0, •• • , 4 

To bound the error probability, let L{x) = — \\\Ax — y\\ 2 be the likelihood. Note that 
{x ^ x°} C {i(ac) - £0°) > 0} C - L(a; ) > 0}. 

This together with {x x°} C {x £ S} implies that 

{x ^ x°} C {L(x) - L{x°) >0}D{xe S}. 

Consequently, 

I = P(x ^ x°) 

< p(.L(x) - i(a; ) >0;ies) 

»S 

<EE E P *( SU P (L(x)-L(x°))>0) 

s° s° 

^EE E P *( SU P (Z(x)-L(x°))>o), 

»=1 J=l \C\=i,\Ao\=j {-log(l-h 2 (x,x«))>max(i,l)C nl i n (x")-d 3 T d 2 p , x eSA c ,c} 

where P* is the outer measure and the last two inequalities use the fact that Sa c c £ {x € <Sa c c : 
max(|G° \ C|, l)G min (x°) < - log(l - h 2 (x, x ))} C {- log(l - fc 2 (x, a; )) > d x max(i, l)G min (x°) - d 3 T d *p}, 
under Assumption [3] 

For /, we apply Theorem 1 of [22 to bound each term. Towards this end, we verify their entropy 
condition (3.1) for the local entropy over Sa c ,C for |G| = 1, • • • , s° and \A\ = 1, • • • , s". Under Assumption^ 

£ = £n,p = {2co) 1 / 2 c i 1 log(2 1 / 2 /c3) logp(^) 1 / 2 satisfies there with respect to e > 0, that is, 

2 1 / 2 S 

sup / ff 1 / 2 (t/ C 3,-F Jl )*<% /2 2 1/2 £log(2/2 1 / 2 c 3 )<c 4 n 1 / 2 £ 2 . (16) 

{0<|A|<p } J2-s e 2 
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for some constant C3 > and C4 > 0, say C3 = 10 and C4 = — ■ By Assumption [21 C m [ n (x°) > e 2 lVQV 
implies JIB]), provided that s\ > {2c ) 1 / 2 c i 1 log(2 1 / 2 /c 3 ). 

Note that IB, I = ( f fi ) V* = n ( |G| 7 S °) < {\G\(\G\ - < (|G| 2 /4) 4 by the binomial coefficients formula. 

Moreover, Ylj=i 2 J 'i^ < i s °, and hj =i (1 —j )^ = (^*)^ using the Multinomial Theorem. By Theorem 

1 of [22], there exists a constant C2 > 0, say C2 = nJlMe' 

s° s° 

I < El^lE E (■ J ■ )2 jl ---V i exp(-c 2 niC min (x )) 

4 

< ex P ( - c 2mC min (x°) + 2i(log |G| + logs?)) 

i=l 

< exp ( - c 2 nC min {x°) + 2(log |G| + log s?)) . 

Let G={£^ A }. For the risk property, Eh 2 (x, x°) = Eh 2 (x°, x°) + Eh 2 (x, x°)I(G) is upper bounded 

by 

Eh 2 (x, x°) + exp ( - c 2li C min (s°) + 2(log \G\ + logs?)) = (1 + o(l))Eh 2 (x°, a; ), 
using the fact that h(x,x°) < 1. This completes the proof. 

2 Proof of Theorem [3] 

We utilize an intermediate lemma from [4]: 

Lemma 2. Let X be a metric space and U be a normed space. Suppose that for all x G X , the function 
tp{x,-) is differ entiable and that ip(x,Y) and Dy^Pix^) (the partial derivative of i)(x,Y) with respect to 
Y) are continuous on X x U . Let $ be a compact subset of X. Define the optimal value function as 
4>(Y) = inf^g^ i>{x, Y). The optimal value function <p(Y) is directionally differentiable. In addition, if for 
any Y € U , ip(-,Y) has a unique minimizer x(Y) over then (f>(Y) is differentiable at Y and the gradient 
of<j)(Y) is given by <p'(Y) = D Y ip(x(Y),Y). 



Proof of Theorem^ For the proof, an intermediate lemma will be used, with its details given in the Ap- 
pendix. Since both constraints are active, if (x, A, 77) = SGLP(u, s±, s 2 ), then x and A are also the optimal 
solutions to the following problem: 

maximize minimize ip(x, A) = -||x — + A(||a;||i — Sx), 

A x£X 2 

where X — {x : \\x\\q < s 2 }. By LemmaO 0(A) = m£ xe x "4>{x, A) is differentiable with the derivative given 
by 1 1 set In addition, as a pointwise infimum of a concave function, so does </>(A) [5J and its derivative, 
||x||i, is non-increasing. Therefore si = \\x\\i is non-decreasing as A becomes smaller. This completes the 
proof. □ 

3 Algorithm for Solving ( |T3l ) 

We give a detailed description of algorithm for solving the restricted projection (fT"3|) in Algorithm |3] 
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Algorithm 3 Restricted Sparse Group Lasso Projection Algorithm 



Input: v, si, s 2 , T u T 3 

Output: an optimal solution a; to the Restricted Sparse Group Projection Problem (|13[) 
Function RSGLP(v, si, s 2 , T x , T 3 ) 

1: if ||a; Tl ||i < si and ||x T3 || G < s 2 then 

2: return u 

3: end if 

4: xg°° = vW, X T C \ = V{ X {v T ^ 
5: xg» )0 =«( T ») e , X%=V S J{V T °) 

6: xgf = w( r e , x T c \= bisect, Si, S 2 , ?!, T 3 ) 

7: if ||x£ 3 J G < «2 then 

8: return Xd 

9: else if ||x^||i < s\ then 
10: return xc 2 
11: else 

12: return xc 12 
13: end if 

Function bisec(v, si, s%, Ti, T 3 ) 
1: Initialize up, low and toZ 
2: while itp — Zow > tol do 

3: A = (/ow + up)/2 

4: if (|15[) has a solution f\ given v x then 
5: calculate s~i using fj and A. 
6: if si < si then 

7: up = A 

8: else 

9: low — A 

10: end if 
11: else 

12: up = A 

13: end if 
14: end while 

15: A* = up 

16: Solve (JTSJ) to get r]* 

17: Calculate (x*) Tl from A* and rj* . 

18: return (x*) Tl 



4 Accelerated Gradient Method 



The AGM procedure is listed in Algorithms IU in which f(x) is the objective function ^||j4:e — with 
V/ (x) denotes its gradient at x. In addition, fL, u {x) is the linearization of /(x) at u defined as follows: 

f L , u (x) = /(«) + V/(tx) T (x - u) + ~||z - 
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Algorithm 4 Accelerated Gradient Method [151 [2] for (|7| 
Input: A, y, s x , s 2 , L , x , 
Output: solution x to ||7J) 

1: Initialize: Lo, a^i = xq, a_j = 0, ao = 1, t = 0. 

2: repeat 

3: * = t + l,A = ^= i ,« t = a;t+ i 8t(» t -a!t_i) 

4: Line search: Find the smallest L = 2 3 L t _ 1 such that 

f(x t +l) < fL,u t (x t+ l), 

where x t+1 = SGLP(« t - ±Vf(u t ), Si,s 2 ) 

5: Ctt+l — — , L t = L. 

6: until Converge 
7: return x t 



5 ADMM Projection algorithm 



ADMM is widely chosen for its capability of decomposing coupled variables/constraints, which is exactly 
the case in our projection problem. Before applying ADMM, we transform (JHJ) into an equivalent form as 
follows: 

... 1.. 



minimize —\\x — v 



2 

subject to ||m||i < s\ 
\\w\\g < s 2 

U = X,W = X. 

The augmented Lagrangian is: 

C(x, \,rj) = - \\x — v\\l + \ T (u — x) + rj T (w — x) 
+ ^\\u-x\\l+\\w-x\\l). 

Utilize the scaled form [5], i.e., let A = — , r/ = 2, we can obtain an equivalent augmented Lagrangian: 

C(x,\,rj) = ^\\x - v\\ 2 2 + ^(\\x - u - \\\ 2 2 + \\x - w - f]\\l) 
-f(l|A||| + NII)- 

Now we calculate the optimal x, A and r\ through alternating minimization. For fixed u and w, the 
optimal x possesses a closed-form solution: 

x = 1 (v + p{u + A + W + T))) . 
1 + 2p 

For fixed x and u, finding the optimal w is a group lasso projection: 

minimize ~\\w - (x - 

w i (17) 

subject to ||w||g < s i 
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For fixed x and w, finding the optimal u amounts to solve an Li-ball projection: 

minimize -\\u - (x - X)\\l 

u 2 (18) 

subject to || xx. || i < s\. 
The update of multipliers is standard as follows: 

A = A + u — x 

(19) 

rj = r] + w x 

Algorithm [5] summarizes the above procedure. Note that, the value of the penalty term p is fixed in 
Algorithm[5] However, in our implementation, we increase p whenever necessary to obtain faster convergence. 



Algorithm 5 ADMM [5] for © 
Input: v, si, S2 

Output: an optimal solution x to (|8j) 
1: Initialize: xq, ito, wq, Ao, rjo, t = 0, p > 
2: repeat 

3: t = t+l 

4: X t = (V + p(u t -l + A 4 _i + W t -1 + Vt-l)) 

5: W t = V S (?(xt - IJt-i) 

6: U t = Vl 1 (x t -\t-l) 

7: A t = A t _i + U t - X t , T] t = Vt-l + W t ~ X t . 

8: until Converge 
9: return x t 



6 Dykstra's Algorithm 

Dykstra's algorithm is a general scheme to compute the projection onto intersections of convex sets. It 
is carried out by taking Euclidean projections onto each convex set alternatively in a smart way and is 
guaranteed to converge for least squares objective function 0. The details of applying Dykstra's Algorithm 
to our projection problem are listed in Algorithm [6] 



Algorithm 6 Dykstra's Algorithm [7] for © 
Input: v, si, S2 

Output: an optimal solution x to {8} 
1: Initialize: x$ — v, po = 0, qo = 0, t — 
2: repeat 

3: t = t+l 

4: Vt-l = V S (?{xt-l +Pt-l) 

5: p t = X t -l +Pt-1 - Vt-l 

6: X t = VI 1 {yt-l + qt-l) 

7: q t = y t -i + qt-i - x t 
8: until Converge 
9: return Xt 
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